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Abstract 

The edit distance between two words wi, W2 is the minimal number of word operations (letter insertions, dele¬ 
tions, and substitutions) necessary to transform wi to W2- The edit distance generalizes to languages £1, £2, where 
the edit distance from £1 to £2 is the minimal number k such that for every word from £1 there exists a word in 
£2 with edit distance at most k. We study the edit distance computation problem between pushdown automata and 
their subclasses. The problem of computing edit distance to a pushdown automaton is undecidable, and in practice, 
the interesting question is to compute the edit distance from a pushdown automaton (the implementation, a standard 
model for programs with recursion) to a regular language (the specification). In this work, we present a complete 
picture of decidability and complexity for the following problems: ( 1 ) deciding whether, for a given threshold k, the 
edit distance from a pushdown automaton to a finite automaton is at most k , and ( 2 ) deciding whether the edit distance 
from a pushdown automaton to a finite automaton is finite. 


1 Introduction 

Edit distance. The edit distance fl 6 l between two words is a well-studied metric, which is the minimum number of 
edit operations (insertion, deletion, or substitution of one letter by another) that transforms one word to another. The 
edit distance between a word w to a language C is the minimal edit distance between w and words in C. The edit 
distance between two languages £1 and £2 is the supremum over all words w in £1 of the edit distance between w 
and £ 2 . 

Significance of edit distance. The notion of edit distance provides a quantitative measure of “how far apart’’ are 
(a) two words, (b) words from a language, and (c) two languages. It forms the basis for quantitatively comparing 
sequences, a problem that arises in many different areas, such as error-correcting codes, natural language processing, 
and computational biology. The notion of edit distance between languages forms the foundations of a quantitative 
approach to verification. The traditional qualitative verification (model checking) question is the language inclusion 
problem: given an implementation (source language) defined by an automaton Ai and a specification (target language) 
defined by an automaton As, decide whether the language C(Ai) is included in the language C(As) (i.e., C(Ai) C 
C(As)). The threshold edit distance (TED) problem is a generalization of the language inclusion problem, which 
for a given integer threshold k > 0 asks whether every word in the source language C(A[) has edit distance at most 
k to the target language C{As) (with k = 0 we have the traditional language inclusion problem). For example, in 
simulation-based verification of an implementation against a specification, the measured trace may differ slightly from 
the specification due to inaccuracies in the implementation. Thus, a trace of the implementation may not be in the 
specification. However, instead of rejecting the implementation, one can quantify the distance between a measured 
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(RiSE), ERC Start grant (279307: Graph Games), MSR faculty fellows award, and by the National Science Centre (NCN), Poland under grant 
2014/15/D/ST6/04543. 
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Table 1: Complexity of the language inclusion problem from Ci to C 2 . Our results are boldfaced. 



C 2 = DFA 

C 2 = NFA 

C 2 = DPDA 

C 2 = PDA 

Ci G {DFA, NFA) 

coNP-c 0 

PSpace-c 0 

open (Conj.[35J) 


Ci G {DPDA, PDA) 

CONP-complete 

(Th.|22| 

ExpTime-c 

(Th.[OJ 

undecidable (Prop.UTli 


Table 2: Complexity of FED(Ci,C 2 ). Our results are boldfaced. 
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Table 3: Complexity of TED(Ci,C 2 ). Our results are boldfaced. 


trace and the specification. Among all implementations that violate a specification, the closer the implementation traces 
are to the specification, the better Il6lf8l fl3l . The edit distance problem is also the basis for repairing specifications m 

0 . 

The TED problem answers a fine-grained question with a fixed bound on the number of edit operations. A related 
problem, the finite edit distance (FED) problem, asks whether there exists k > 0 such that the answer to the TED 
problem with threshold k is YES. Hence, in verification applications we ask the FED question first, and in case of the 
positive answer, we can ask the TED question. 

Our models. In this work we consider the edit distance computation problem between two automata Ai and A-j, 
where Ai and A 2 can be (non-)deterministic finite automata or pushdown automata. Pushdown automata are the 
standard models for programs with recursion, and regular languages are canonical to express the basic properties of 
systems that arise in verification. We denote by DPDA (resp., PDA) deterministic (resp., non-deterministic) pushdown 
automata, and DFA (resp., NFA) deterministic (resp., non-deterministic) finite automata. We consider source and target 
languages defined by DFA, NFA, DPDA, and PDA. We first present the known results and then our contributions. 
Previous results. The main results for the classical language inclusion problem are as follows HU: (i) if the target 
language is a DFA, then it can be solved in polynomial time; (ii) if either the target language is a PDA or both source 
and target languages are DPDA, then it is undecidable; (iii) if the target language is an NFA, then (a) if the source 
language is a DFA or NFA, then it is PSpace-complete, and (b) if the source language is a DPDA or PDA, then it is 
PSpace-hard and can be solved in ExpTime (to the best of our knowledge, there is a complexity gap where the upper 
bound is ExpTime and the lower bound is PSpace). The TED and FED problems were studied for DFA and NFA. 
The TED problem is PSpace-complete, when the source and target languages are given by DFA or NFA f2]0. When 
the source language is given by a DFA or NFA, the FED problem is: (i) CoNP-complete, when the target language is 
given by a DFA 0, (ii) PSpace-complete, when the target language is given by an NFA 0. 

Our contributions. Our main contributions are as follows. 

1. We show that the TED problem is ExpTime-complete, when the source language is given by a DPDA or a 
PDA, and the target language is given by a DFA or NFA. We present a hardness result which shows that the 
TED problem is ExpTime-hard for source languages given as DPDA and target languages given as DFA. We 
present a matching upper bound by showing that for source languages given as PDA and target languages 
given as NFA the problem can be solved in ExpTime. As a consequence of our lower bound we obtain that 
the language inclusion problem for source languages given by DPDA (or PDA) and target languages given by 
NFA is ExpTime-complete. In contrast, if the target language is given by a DPDA, then the TED problem is 
undecidable even for source languages given as DFA. Thus we present a complete picture of the complexity of 
the TED problem, and in addition we close a complexity gap in the classical language inclusion problem. Note 
that the interesting verification question is when the implementation (source language) is a DPDA (or PDA) and 
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the specification (target language) is given as a DFA (or NFA), for which we present decidability results with 
optimal complexity. 

2. We also study the FED problem. For finite automata, it was shown in |2] |3) that if the answer to the FED 
problem is YES, then a polynomial bound on k exists. In contrast, the edit distance can be exponential between 
DPDA and DFA. We present a matching exponential upper bound on k for the FED problem from PDA to NFA. 
We show that when source languages are given as DPDA or PDA, the FED problem is: (i) CoNP-complete, if 
the target languages are given as DFA, and (ii) ExpTime-complete, if the target languages are given as NFA. 
The lower bound in (i) holds even for source languages given as DFA 0. Our results are summarized in Tables □ 
[2] and [3] 

This paper extends J7J in the following two ways: 

• We provide full proofs of all results from |[7j. 

• We show that the FED problem is CoNP-complete if the source language is given by DPDA or PDA and the 
target language is an DFA. This result is technically involved, but it completes the complexity picture for the 
FED problem in case of the source language given by a pushdown automaton and the target language given by 
a finite automaton. 

Related work. Algorithms for edit distance have been studied extensively for words (IS[I][Ell|20lQ3JED. The edit 
distance between regular languages was studied in (2] [3|, between timed automata in |9|. and between straight line 
programs in mm. A near-linear time algorithm to approximate the edit distance for a word to a Dyck language 
has been presented in OTI . 

2 Preliminaries 

2.1 Words, languages and automata 

Words. Given a finite alphabet E of letters, a word w is a finite sequence of letters. For a word w, we define w [z] as 
the z-th letter of w and | w\ as its length. For instance, if w = abc, then u>[2] = b and | w\ = 3. We denote the set of all 
words over E by E*. We use e to denote the empty word. 

Pushdown automata. A (non-deterministic)pushdown automaton (PDA) is a tuple (E, /’, Q. S, 6 , F), where E is the 
input alphabet, r is a finite stack alphabet, Q is a finite set of states, S C Q is a set of initial states, 5 C Q x E x (r U 
{ 1 }) x Q x r* is a finite transition relation and F C Q is a set of final (accepting) states. A PDA (E. l \ Q, S, S, F) is 
a deterministic pushdown automaton (DPDA) if |S| = 1 and S is a function from Q x E x (r U {E}) to Q x T*. We 
denote the class of all PDA (resp., DPDA) by PDA (resp., DPDA). We define the size of aPDA A = ( E , r , Q, S , <5, F), 
denoted by |.4|, as \Q\ + |<5|. 

Runs of pushdown automata. Given a PDA A and a word w = w [ 1 ] • ■ ■ w[k\ over E, a run tt of A on w is a 
sequence of elements from Q x T* of length k + 1 such that 7r[0] £ S x {e} and for every i £ {1,..., k} either 
(1) 7r[z — 1] = (q,e), 7r[z] = (q',u') and (q, w[i], _L, q 1 , u') £ 6, or (2) tr[i — 1] = ( q,ua ), tt[i] = ( q',uu') and 
(q, tu[i], a, q', u ') £ S. A run 7r of length k + 1 is accepting if 7r[fc] G F x {e}, i.e., the automaton is in an accepting 
state and the stack is empty. The language recognized (or accepted) by A , denoted C{A), is the set of words that have 
an accepting run. 

Context free grammar (CFG). A context free grammar (CFG) is a tuple ( E , V, S, P), where E is the alphabet, V 
is a set of non-terminals, S G V is a start symbol and P is a set of production rules. A production rule p has the 
following formp : A —> u, where A £ V and u G (E U V)*. 

A CFG in Chomsky normal form (CNF) is the special case in which each production rule p has one of the following 
forms (recall that S is the start symbol): (1) p : A —> BC, where A £ V and B,C £ V \ {S'}; or (2) p : A —> a, 
where A £ V and a £ E; or (3) p : S —> e. It is well-known that any CFG can be brought onto CNF in polynomial 
time liTTl . 

Languages generated by CFGs. Fix a CFG G = (E, V, S, P). We define derivation -Pc as a relation on (E U 
V)* x (E U V")* as follows: w -P-g w 1 iff w = W 1 AW 2 , with A £ V, and w’ = W 1 UW 2 for some u £ (E U V")* 
such that A —> u is a production from G. We define -P* G as the transitive closure of —>g- The language generated by 
G, denoted by C(G) = {w £ E* \ S -Pq u>} is the set of words that can be derived from S. We omit G and write 
—>* for -p* G if G is clear from the context and for any non-terminal A and word tv G (E U V )*, we call A -P* w an 
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w[4] = a 


in [3] = a 

Figure 1: Example of a derivation tree of w = aabb for the CFG G given in paragraph "Languages generated by 
CFGs." 

implied production rule. For instance, the CFG G = (E, V. S. P), where E = {a, 6}, V = {,S'}, and the rules P are 
S —> aSb and S —> ab, generates the language {a n b n \ n > 1}. 

It is well-known fl4l that CFGs and PDAs are language-wise polynomial equivalent (i.e., there is a polynomial 
time procedure that, given a PDA, outputs a CFG of the same language and vice versa). 

Derivation trees of CFGs. Fix a CFG G = (E, V, S, P). The CFG dehnes a (typically infinite) set of derivation 
trees. A derivation tree is an ordered tre^3 where (1) each leaf is associated with an element of E U V U {e}; and 
(2) each internal node q is associated with a non-terminal A £ V and production rule p : A —>• w, such that A has | w \ 
children and the i-th child, for each i, is associated with w[i\ if it is a leaf or a production rule p' : w[i] —> w' if it is 
an internal node. A derivation tree T dehnes a string w(T) over E U V formed by reading labels of the leaves of T in 
an ascending lexicographic path order (“from left to right”) while skipping e symbols. Existence of a derivation tree 
T with the root A certifies that A -£* G w(T). For instance given G (as in the previous paragraph), the derivation tree 
for aabb is as given in Figure Q] 

Finite automata. A non-deterministic finite automaton (NFA) is a pushdown automaton with empty stack alphabet. 
We will omit P while referring to NFA, i.e., we will consider them as tuples (E. Q. S. 6, F). We denote the class of 
all NFA by NFA. Analogously to DPDA we define deterministic finite automata (DFA). 

Language inclusion. Let Ci,C 2 be subclasses of PDA. The inclusion problem from C\ in C 2 asks, given A\ £ C±, 
A 2 £ C 2 , whether C(A\) C £(_4 2 ). 

Single letter operations on words. A single letter operation on a word can be either an insertion , a deletion, or a 
substitution. Given a letter a £ E and a number i we define relations — ~>D(a,i), ~>s(a,i)Q ^* x E* as follows 

• the insert relation —f° r all ui, w' we have w w ' iff w' = w[l]... w[i\aw[i + 1]... F° r 

example, abc —»j( a , 2 ) abac. 

• the delete relation —»• D ( a ,i)'■ f° r all w, w' we have w —> o(a,i) w' iff w' = iu[l]... w[i — l]u>[« + 1]... w[|u;|]. 

For example, abc —>D(b, 2 ) ac - (Note that we ignore the letter parameter for deletions. We use over a 

notation like — >D(i) t0 ensure that all three types of single letter operations have 2 parameters) 

• the substitution relation for all w, w' we have w w ' iff w 7 = tu[l] ... w[i— l]aw[z+l]... w[|w|]. 

For example, abc —>s(a, 2 ) aac. 

Edit distance between words. Given two words w\,W 2 , the edit distance between w±,W 2 , denoted by ed(w-\, '«; 2 ), 
is the minimal number of single letter operations: insertions, deletions, and substitutions, necessary to transform w\ 
into u> 2 - More formally, k := ed{w\, W 2 ) is the length of the shortest sequence S 1 S 2 ■ ■ ■ Sk , where each Sj is an 
operation Sj = ( Pj,aj,ij ) £ {I, D, S} x E x N for each j, such that there exist words Si,i £ {0,..., k}, for which 
(1) wi = s 0 , (2) w 2 = s k and (3) sj -1 Sj for all j £ {1,..., k}. 

1 In an ordered tree, children of every node are ordered. 
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Edit distance between languages. Let £j . £2 be languages. We define the edit distance from C \ to £ 2 . denoted 
ed(Ci, £ 2 ), as sup Wlg £ 1 inf„, 2g £ 2 ed{wi,wf). The edit distance between languages is not a distance function. In 
particular, it is not symmetric. For example: ed({a}*, {a, 6}*) = 0, while ed({a , 6}*, {a}*) = 00 because for every 
n, we have ed({b n }, {a}*) = n. 

2.2 Problem statement 

In this section we define the problems of interest. Then, we recall the previous results and succinctly state our results. 

Definition 1. ForC\,C 2 G {DFA, NFA, DPDA, PDA} we define the following questions: 

1. The threshold edit distance problem from C\ to £2 (denoted TED(£i, £ 2 )): Given automata A\ G £ 1 , A 2 G £2 
and an integer threshold k > 0, decide whether ed{C{A \), £(^ 2 )) < k. 

2. The finite edit distance problem from Ci to C 2 (denoted FED(Ci,C 2 )): Given automata A\ G Ci, A-> G £ 2 . 
decide whether ed(C(A\) , £(^ 2 )) < 00 . 

3. Computation of edit distance from C\ to C 2 : Given automata Ai G Ci, A 2 G £ 2 , compute ed(C{A\ ), £(^ 2 ))- 

We establish the complete complexity picture for the TED problem for all combinations of source and target 
languages given by DFA, NFA, DPDA and PDA: 

1. TED for regular languages has been studied in 0, where PSpace-completeness of TED(£i, £ 2 ) for £ 1 , £2 G 
{DFA, NFA} has been established. 

2. In Section^ we study the TED problem for source languages given by pushdown automata and target languages 
given by finite automata. We establish ExpTime-completeness of TED(£i,£ 2 ) for C\ G {DPDA, PDA} and 
£2 G {DFA, NFA}. 

3. In Section |5J we study the TED problem for target languages given by pushdown automata. We show that 
TED(£!, £ 2 ) is undecidable for C\ G {DFA, NFA, DPDA, PDA} and £ 2 G {DPDA, PDA}. 

We study the FED problem for all combinations of source and target languages given by DFA, NFA, DPDA and 
PDA and obtain the following results: 

1. FED for regular languages has been studied in (3). It has been shown that for £1 G {DFA, NFA}, the problem 
FED(£i, DFA) is CoNP-complete, while the problem FED(£i, NFA) is PSpace-complete. 

2. We show in Section[4]that for C\ G {DPDA, PDA}, the problem FED(£i, NFA) is ExpTime-complete and the 
problem FED(£i, DFA) is CoNP-complete. 

3. We show in Section0that (1) for£i G {DFA, NFA, DPDA, PDA}, the problem FED(£i, PDA) is undecidable, 
and (2) the problem FED (DPDA, DPDA) is undecidable. 

Remark 2. Weighted edit-distance. One could also consider a notion of weighted edit-distance, where a weight 
function f : {/, D, S} X S —> Z is given that to each edit operation and letter assigns a weight. I.e. inserting 
a letter a might have a different weight from inserting a letter b. The weighted edit-distance wed(w 1 ,^ 2 ) would 
then be the minimum sum of weights X^=i /(-Pj, ctj) over any k and sequence of edit operation S\ ... S k , where 
Sj = ( Pj,cij,ij ) G {I,D,S} x S x N for each j, such that there exists words Si, i G {0,... ,k}> f or which 
(l)wi = s 0 , (2) w 2 = s k and (3) sj-i sj for all j G {1,..., k}. 

Our results extend to the case where f assigns positive weights. There are naturally no differences for the FED 
case (since if the minimum length is infinite, then so too is the sum of weights). There are no differences either for the 
TED case, since the only time it comes up (in the following Claim\5 } there are no differences. 

Allowing / to assign zero or infinite weights leads to distances very different from the classical edit distance, such 
as the Humming distance, or the length difference. Such distances are out of scope of this paper. 


3 Threshold edit distance from pushdown to regular languages 

In this section we establish the complexity of the TED problem from pushdown to finite automata. 

Theorem 3. (1) For £1 G {DPDA, PDA} and £2 G {DFA, NFA}, the TED(£i,£ 2 ) problem is ExpTime-complete. 
(2) For Ci G {DPDA, PDA}, the language inclusion problem from C\ in NFA is ExpTime-complete. 
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We establish the above theorem as follows: In Section 13.11 we present an exponential-time algorithm for 
TED(PDA, NFA) (for the upper bound of (1)). Then, in Section [3721 we show (2), in a slightly stronger form, and 
reduce it (that stronger problem), to TED(DPDA, DFA), which shows the ExpTime-hardness part of (1). We con¬ 
clude this section with a brief discussion on parametrized complexity of TED in Section [331 

3.1 Upper bound 

We present an ExpTime algorithm that, given (1) a PDA Ap; (2) an NFA .Ay; and (3) a threshold t given in binary, 
decides whether the edit distance from Ap to An is above /;. The algorithm extends a construction for NFA by 
Benedikt et al. j2|. 

Intuition. The construction uses the idea that for a given word w and an NFA A.y the following are equivalent: 
(i) ed[w, An) > t, and (ii) for each accepting state s of An and for every word w', if An can reach s from some 
initial state upon reading w', then ed(w, w' ) > t. We construct a PDA A/ which simulates the PDA Ap and stores 
in its states all states of the NFA An reachable with at most t edits. More precisely, the PDA A/ remembers in its 
states, for every state s of the NFA An, the minimal number of edit operations necessary to transform the currently 
read prefix w p of the input word into a word w' p , upon which A.y can reach s from some initial state. If for some state 
the number of edit operations exceeds t, then we associate with this state a special symbol # to denote this. Then, we 
show that a word w accepted by the PDA Ap has ed(w, A.y) > t iff the automaton A/ has a run on w that ends (1) in 
an accepting state of simulated Ap, ( 2 ) with the simulated stack of Ap empty, and (3) the symbol # is associated with 
every accepting state of An- 

Lemma 4. Given (1) a PDA Ap; (2) an NFA An; and (3) a threshold t given in binary, the decision problem of 
whether ed{Ap, An) < t can be reduced to the emptiness problem for a PDA of size 0(|Ap| • (t + 2) I). 

Proof Let Qn (resp., Fn) be the set of states (resp., accepting states) of An- For AH and a word w, we define 
Tw = {s £ Qn '■ there exists w' with ed{w, w') = i such that An has a run on the word w' ending in s}. For a pair 
of states s,s' £ Qn and a £ E U {e}, we define m(s, s', a) as the minimum number of edits needed to apply to a so 
that An has a run on the resulting word from s' to s. For all s,s' £ Qn and a £ ZJU {e}, we can compute m(s, s', a) 
in polynomial time in |Ajv|- For a state s £ Qn and a word w let d s w = min{* > 0 | s £ T^}, i.e., d®, is the minimal 
number of edits necessary to apply to w such that A.y reaches s upon reading the resulting word. We will first prove 
the following claim. 

Claim 5. We have that d s wa = nmVgQ^. (d®, + m(s, s', a)) 

Proof. Consider a run witnessing d s wa . As shown by ll22fl we can split the run into two parts, one sub-run on w ending 
in s', for some s', and one sub-run on a starting in s'. Clearly, the sub-run on w has used d s w edits and the one on a 
has used m(s, s', a) edits. □ 

Let Qp (resp., Fp) be the set of states (resp., accepting states) of the PDA Ap. For every word w and every 
state q £ Qp such that there is a run on w ending in q, we define lmpact(u>, q, Ap, An, t) as a pair (q, A) in Qp x 
{0,1 ,,t, where A is defined as follows: for every s £ Qn we have A(s) = d s w if d s w < t, and A (s) = # 

otherwise. Clearly, the edit distance from Ap to An exceeds t if there is a word w and an accepting state g of Ap 
such that lmpact(ui, q, Ap, An, t) is a pair (q, A) and for every s £ Fn we have A(s) = # (i.e., the word w is in 
C{Ap) but any run of Ajv ending in f’.y has distance exceeding t). 

We can now construct an impact automaton, a PDA A/, with state space Qp x {0A> ■ ■ • ,t, 4F\® N an d the transition 
relation defined as follows: A tuple {{q, Ai), a, 7 , (q 1 , X 2 ), u) is a transition of A/ iff the following conditions hold: 

1 . the tuple projected to the first component of its state (i.e., the tuple (q, a, 7 , q', u)) is a transition of Ap, and 

2 . the second component A 2 is computed from Ai using Claim [5j i.e., for every s £ Qn we have A 2 (s) = 
min s / G Q JV (Ai(s') + m(s,s' ,a)). 

The initial states of A/ are Sp x {Ao}, where Sp are initial states of Ap and Ao is defined as follows. For every 
s £ Qn we have Ao(s) = muis/gs^. m(s, s', e), where Sn are initial states of An (i-S-, a start state of A/ is a pair 
of a start state of Ap together with the vector where the entry describing s is the minimum number of edits needed 
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to get to the state s on the empty word). Also, the accepting states are {(q, A) | q £ Fp and for every s £ Fn we 
have A(s) = #}. Observe that for a run of Ai on w ending in (s, A), the vector lmpact(w, s, Ap,An, t) is precisely 
(s, A). Thus, the PDA Ai accepts a word w iff the edit distance between Ap and An is above t. Since the size of Ai 
is 0(|Alp| • (f + 2)1^1) we obtain the desired result. 

□ 


Lemma|4] implies the following: 

Lemma 6. TED(PDA, NFA) is in ExpTime. 

Proof. Let Ap, An and t be an instance of TED(PDA, NFA), where Ap is a PDA, An is an NFA, and / is a threshold 
given in binary. By Lemma[4] we can reduce TED to the emptiness question of a PDA of the size 0(\Ap\- (t+2)' AN I ). 
Since \Ap\ ■ (t + 2)l- 4;v 'l is exponential in \ Ap + |Ay I + t and the emptiness problem for PDA can be decided in time 
polynomial in their size fl4l . the result follows. 

□ 


3.2 Lower bound 

Our ExpTime-hardness proof ofTED(DPDA, DFA) extends the idea from [2] that shows PSpace-hardness of the edit 
distance for DFA. The standard proof of PSpace-hardness of the universality problem for NFA 1141 is by reduction 
to the halting problem of a fixed Turing machine M working on a bounded tape. The Turing machine M is the one 
that simulates other Turing machines (such a machine is called universal). The input to that problem is the initial 
configuration C\ and the tape is bounded by its size |Cj . In the reduction, the NFA recognizes the language of all 
words that do not encode a valid computation of M starting from the initial configuration Cj, i.e., it accepts if one 
of the following conditions is violated: (1) the given word is a sequence of configurations, (2) the state of the Turing 
machine and the adjacent letters follow from transitions of M, (3) the first configuration is C\ and (4) the tape’s cells 
are changed only by M, i.e., they do not change values spontaneously. While violation of conditions (1), (2) and 
(3) can be checked by a DFA of polynomial size, condition (4) can be encoded by a polynomial-size NFA but not 
a polynomial-size DFA. However, to check (4) the automaton has to make only a single non-deterministic choice to 
pick a position in the encoding of the computation, which violates (4), i.e., the value at that position is different from 
the value | C\ | + 1 letters further, which corresponds to the same memory cell in the successive configuration, and the 
head of M does not change it. We can transform a non-deterministic automaton An checking (4) into a deterministic 
automaton Ad by encoding such a non-deterministic pick using an external letter. Since we need only one external 
symbol, we show that C{An) = £* iff ed(£*, C(Ad)) = 1- This suggests the following definition: 

Definition 7 . An NFA A = {£, Q, S, 8 , F) is nearly-deterministic if\S\ = 1 and 5 = <5i U 82, where cq is a function 
and in every accepting run the automaton takes a transition from 82 exactly once. 

Lemma 8. There exists a DPDA Ap such that the problem, given a nearly-deterministic NFA An, decide whether 
C(Ap) C C{An), is ExpTime-hard. 

Proof. Consider the linear-space halting problem for a (fixed) alternating Turing machine (ATM) M: given an input 
word w over an alphabet L\ decide whether M halts on w with the tape bounded by w\. There exists an ATM Alp, 
such that the linear-space halting problem for Mp is ExpTime-complete [[5], We show the ExpTime-hardness of the 
problem from the lemma statement by reduction from the linear-space halting problem for Mp. 

Without loss of generality, we assume that existential and universal transitions of Mp alternate. Fix an input of 
length n. The main idea is to construct the language L of words that encode valid terminating computation trees of 
Alp on the given input. Observe that the language L depends on the given input. We encode a single configuration of 
Mp as a word of length n + I of the form £ l qT, n ~' 1 , where q is a state of Alp. Recall that a computation of an ATM 
is a tree, where every node of the tree is a configuration of Mp, and it is accepting if every leaf node is an accepting 
configuration. We encode computation trees T of Alp by traversing T in pre-order and executing the following: if the 
current node has only one successor, then write down the current configuration C, terminate it with =f and move down 
to the successor node in T. Otherwise, if the current node has two successors s, t in the tree, then write down in order 
(1) the reversed current configuration C R ; and (2) the results of traversals on s and t, each surrounded by parentheses 
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( and ), i.e., C R ( u s ) (vf ), where u s (resp., if) is the result of the traversal of the sub-tree of T rooted at s (resp., t). 
Finally, if the current node is a leaf, write down the corresponding configuration and terminate with $. For example, 
consider a computation with the initial configuration C\, from which an existential transition leads to C 2 , which in 
turn has a universal transition to C 3 and C 4 . Such a computation tree is encoded as follows: 

Ci#c? (C 3 ...$) (<7 4 ...$). 

We define automata Ajv and Ap over the alphabet £ U {#, $, (,)}. The automaton .Ay is a nearly deterministic 
NFA that recognizes only (but not all) words not encoding valid computation trees of Mp. More precisely, A y accepts 
in four cases: (1) The word does not encode a tree (except that the parentheses may not match as the automaton cannot 
check that) of computation as presented above. (2) The initial configuration is different from the one given as the input. 
(3) The successive configurations, i.e., those that result from existential transitions or left-branch universal transitions 
(like C 2 to C 3 ), are not valid. The right-branch universal transitions, which are preceded by the word are not 
checked by An- For example, the consistency of the transition C 2 to C 4 is not checked by An- Finally, (4) A y 
accepts words in which at least one final configuration, which is a configuration followed by $, is not final for Mp. 
Observe that conditions (1), (2) and (4) can be checked by polynomial-size DFA. Condition (3) can be checked by a 
polynomial-size nearly-deterministic NFA, which picks a position in C 2 , for which the corresponding position in C 3 is 
faulty (either contains a spontaneous change of the corresponding tape cell or it is not compatible with any transition of 
Mp). Picking such a position correspond to taking transition S> by a nearly-deterministic NFA. Thus, the automaton 
An is a nearly deterministic NFA, which recognizes the union of automata recognizing (l)-(4). 

Next, we define Ap as a DPDA that accepts words in which parentheses match and right-branch universal transi¬ 
tions are consistent, e.g., it checks consistency of a transition from C 2 to C 4 . The automaton Ap pushes configurations 
on even levels of the computation tree (e.g., C R ), which are reversed, on the stack and pops these configurations from 
the stack to compare them with the following configuration in the right sub-tree (e.g., C 4 ). In the example this means 
that, while the automaton processes the sub-word (C 3 ... $), it can use its stack to check consistency of universal 
transitions in that sub-word. We assumed that Mp does not have consecutive universal transitions. This means that, 
for example, Ap does not need to check the consistency of C 4 with its successive configuration. By construction, we 
have L = C(Ai>) 0 C(An) c (recall that L is the language of encodings of computations of Mp on the given input) 
and Mp halts on the given input if and only if £(Ap) C C(An) fails. Observe that Ap is fixed for all inputs, since it 
only depends on the fixed Turing machine Mp. □ 

Now, the following lemma, which is (2) of Theorem^ follows from Lemma[ 8 ] 

Lemma 9. The language inclusion problem from DPDA to NFA is ExpTime-complete. 

Proof. The ExpTime upper bound is immediate (basically, an exponential determinization of the NFA, followed by 
complementation, product construction with the PDA, and the emptiness check of the product PDA in polynomial 
time in the size of the product). ExpTime-hardness of the problem follows from Lemma[ 8 ] □ 

Now, we show that the inclusion problem of DPDA in nearly-deterministic NFA, which is ExpTime-complete by 
LemmaOU reduces to TED(DPDA, DFA). In the reduction, we transform a nearly-deterministic NFA An over the 
alphabet £ into a DFA Ap by encoding a single non-deterministic choice by auxiliary letters. 

Lemma 10. TED(DPDA, DFA) is ExpTime-hard. 

Proof. To show ExpTime-hardness of TED(DPDA, DFA), we reduce the inclusion problem of DPDA in nearly- 
deterministic NFA to TED(DPDA, DFA). Consider a DPDA Ap and a nearly-deterministic NFA An over an alphabet 
£. Without loss of generality we assume that letters on even positions are 0 G £ and <0 do not appear on the odd 
positions. Let S = <5i U 62 be the transition relation of An, where <5i is a function and along each accepting run. 
An takes exactly one transition from 62 ■ We transform the NFA An to a DFA Ap by extending the alphabet £ 
with external letters {1,..., 1 ^ 2 1}- On letters from £, the automaton Ap takes transitions from <5|. On a letter 
i G {1,..., |5 2 |}, the automaton Ap takes the ?'-th transition from 62 - 

We claim that C(Ap) C C(An) iff ed(C{Ap ), £(Ap)) = 1. Every word w G £(Ap) contains a letter i G 
{1,..., |(5 2 |}, which does not belong to £. Therefore, ed{C(Ap), £(Ap)) > 1. But, if we substitute letter i by the 
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letter in the i-th transition of 82 , we get a word from C(An)- If we simply delete the letter i, we get a word which does 
not belong to C{An) as it has letter <0 on an odd position. Therefore, ed(C(Ap) , C{Ad)) < 1 implies C(Ap) C 
C(An)- Finally, consider a word w' £ C(An)- The automaton An has an accepting run on w', which takes exactly 
once a transition from < 52 - Say the taken transition is the ?'-th transition and the position in w' is p. Then, the word w, 
obtained from w' by substituting the letter at position p by letter i, is accepted by Ad- Therefore, C(Ap) C C(An) 
implies ed(C(Ap), C(Ad)) < 1. Thus we have C{Ap) C C{An) iff ed(C(Ap), £(Ad)) = 1. □ 

3.3 Parameterized complexity 

Problems of high complexity can be practically viable if the complexity is caused by a parameter, which tends to be 
small in the applications. In this section we discuss the dependence of the complexity of TED based on its input 
values. 

Proposition 11. (I) There exist a threshold t > 0 and a DPDA Ap such that the variant o/TED(DPDA, DFA), 
in which the threshold is fixed to t and DPDA is fixed to Ap, is still ExpTime-complete. (2) The variant of 
TED(PDA, NFA), in which the threshold is given in unary and NFA is fixed, is in PTime. 

Proof. (1): The inclusion problem of DPDA in nearly-deterministic NFA is ExpTime-complete even if a DPDA is 
fixed (Lemma[ 8 ). Therefore, the reduction in LemmaflQl works for threshold 1 and fixed DPDA. 

(2): In the reduction from Lemma[4] the resulting PDA has size \Ap\ • (t + 2y^ N ', where Ap is a PDA, An is an 
NFA and t is a threshold. If An is fixed and t is given in unary, then \Ap \ ■ (t + 2)1-^ is polynomial in the size of 
the input and we can decide its non-emptiness in polynomial time. □ 

Conjecture fl2lcompletes the study of the parametrized complexity of TED. 

Conjecture 12. The variant o/TED(PDA, NFA), in which the threshold is given in binary and NFA is fixed, is in 

PTime. 

4 Finite edit distance from pushdown to regular languages 

In this section we study the complexity of the FED problem from pushdown automata to finite automata. 

Theorem 13. (1) For C\ £ {DPDA, PDA} and C 2 £ {DFA, NFA} we have the following dichotomy: for all A\ £ 
Ci, A 2 £ C 2 either ed(C(Ai), £(^ 2 )) is exponentially bounded in \Ai\ + \A 2 \ or ed(C(Ai), £(^ 2 )) is infinite. 
Conversely, for every n there exist a DPDA Ap and a DFA Ap, both of the size 0(n), such that ed{C{Ap ), C(Ap)) is 
finite and exponential in n (i.e., the dichotomy is asymptotically tight). (2) ForC\ £ {DPDA, PDA} the FED(Ci, NFA) 
problem is ExpTime-complete. (3) ForC\ £ {DPDA, PDA} the FED(Ci, DFA) problem is CoNP-complete. (4) Given 
a PDA Ap and an NFA An, we can compute the edit distance ed(C{Ap) , C(An)) in time exponential in |Ap|-F|.Ajvj. 


First, we show in Section |4~T1 the dichotomy of (1), which together with Theorem^ implies the ExpTime upper 
bound for (2). Next, in Section l4~2l we show that FED (PDA, DFA) problem is in CONP, which together with the results 
from (3) shows (3). Finally, in Section 14.31 we show that FED(DPDA, NFA) is ExpTime-hard. We also present the 
exponential lower bound for (1). Conditions (1), (2), and Theorem [3] imply (3) (by iteratively testing with increasing 
thresholds upto exponential bounds along with the decision procedure from Theorem[3]). 

4.1 Upper bound for NFA 

In this section we consider the problem of deciding whether the edit distance from a PDA to an NFA is finite. 

We first give an overview of the section. Let An be an NFA and Ap a PDA that has T non-terminals. We 
show (in Lemma ITSb that for any word w £ C{Ap) one can break the word into chunks w = siui... SfeUfeSfc+ 1 , 
such that Y/i= 1 \sk\ < 2 t and for any t word we defined as we = Si(rt{)... Sk{y/ k )sk+ 1 belongs to C(Ap) (this 
is in some sense the opposite of the pumping lemma, since the part that cannot be pumped is small). We then show 
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(this follows from Lemma IT9l) that if there is a word w £ C(Ap) such that ed(w, C(An)) > 2 T , then for every 
word we defined as above we have ed{we+ 1 , C(An)) > ed(we 1 C(An)) for all t > 0, showing that the edit-distance 
ed(C(Ap), C{An) is unbounded. On the other hand, clearly, if ed(w, C(An)) < 2 T for all w £ C(Ap), then the 
edit-distance ed(C(Ap ), C(An)) < 2 T by definition. 

We start with a reduction of the problem. Given a language £, we define prefix(£) = {?/ : u is a prefix of some 
word from £}. We call an automaton A a safety automaton if every state of A is accepting. Note that automata are 
not necessarily total, i.e. some states might not have an outgoing transition for some input symbols, and thus a safety 
automaton does not necessarily accept all words. Note that for every NFA An, the language prefix(£(.4/v)) i s the 
language of a safety NFA. We show that FED (PDA, NFA) reduces to FED from PDA to safety NFA. 

Lemma 14. Let Ap be a PDA and An cm NFA. The following inequalities hold: 

ed(C(Ap), C(An)) > ed(C(Ap), prefix(C{AN))) > ed(C(Ap), C(An)) ~ |.Ajv| 

Proof. Since C(An) Q prefix(£(^ 7 v)), we have 

ed(C(Ap), C(An)) ^ ed(£(^4p), prefix(£(^4jv))) 

as the latter is the minimum over a larger set by definition. 

Hence, we only need to show the other inequality. First observe that for every w £ prefix(£(^4 v)), upon reading 
w, the automaton An can reach a state from which an accepting state is reachable and thus, an accepting state can be 
reached in at most |„4zv] steps. Therefore, for every w £ prefix(£(_4,jv)) there exists w' of length bounded by |.A/v| 
such that ww' £ C(An)- It follows that ed(C(Ap), prefix(£(.Ajv))) > ed(C(Ap), C(An)) — |^4jv|- □ 

Remark 15. Consider an NFA An recognizing a language such that prefix(C(AN )) = AI*. For every PDA Ap, the 
edit distance ed(C(Ap), £(An)) A bounded by |^4jv|- 

In the remainder of this section we work with context-free grammars (CFGs) instead of PDAs. There are 
polynomial-time transformations between CFGs and PDAs that preserve the generated language; switching from 
PDAs to CFGs is made only to simplify the proofs. The following definition and lemma can be seen as a reverse 
version of the pumping lemma for context free grammars (in that we ensure that the part which can not be pumped is 
small). 

As an abuse of notation we will think of a sequence of words as both the concatenation of the words and the 
sequence. We define [1, k\ = {1,..., k}. 

Left and right language. For a CFG G and a non-terminal A, we define the languages 

£(G, A, L) = {w £ E* | 3w' £ E*(A ->* wAw')} and £(G, A, R) = {w £ E* \ 3w' £ E*(A ->* w'Aw)} . 

Also, the set of directions D is D = {£, II }. We next argue that we can construct a CFG for £(G, A, D). 

Lemma 16. Given a CFG G, a non-terminal A and a direction D, we can construct in polynomial time a CFG G' for 
which C(G') = C(G, A, D). 

Proof. We describe the construction of a CFG for £(G, A , L) and the construction for £(G, A, R) is similar. 

To simplify, we consider G to be on CNF. We construct G' as follows: The CFG G' consists of two versions of 
each non-terminal in G, one with a star and one without. I.e. for each non-terminal X £ G, we have the non-terminals 
X and X* in G*. The idea is that X* derives prefixes of words derivable from A' in G, which ends just before a A. 
The productions are then as follows: 

• Non-starred. Each production of G is also in G', which defines the productions for the non-starred non¬ 
terminals. 

• Starred. For each production A' —► BC in G, there are productions X* —y BC* and A* —y B* in G'. 

• Additional for A*. The non-terminal A* has the production A* —> e in G' (no other starred non-terminal can 
produce any terminal). 
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The start symbol of G' is A*. We next argue that £(G') = £(G, A, L). 

C(G') C £(G, A , L). It is easy to see from the productions of G' that the only way to remove a starred non-terminal 
is to eventually replace a A* by e. By construction this is the last non-terminal in some prefix w of a word in G with 
start state A and thus w G £(G, A, L). 

£(G, A , L) C £(G'). Given a word w in £(G, A, L) by definition there is an implied production rule A —>* wAw' 
(in G) for some w'. Given a derivation tree T> for the implied production rule A —>* wAw', it is easy to construct 
a derivation tree V for w in £(G'), indicating that w is in £(G'). The two trees V and V are identical except as 
follows: For a node v in V let V(v) be the corresponding node in V. Let l be the leaf in V such that T>(y) is the leaf 
with label A in V. The production rule of £ is A* —> e. Then, consider the path n from t to the root of V. For each 
internal node v in tt where T>(v) has production rule X —> BC , we have the following: 

• 7r comes from the left child. If n goes through the left child, the production rule of v is X* —> B* and the 
sub-tree under the right child of T>[v) is cut out of V (including that v has no right child in this case). 

• 7r comes from the right child. If P goes through the right child the production rule of v is X* —y BC*. 

Then, tree V spells the word w and is a derivation tree in G'. Thus w G £ ( G ') and the lemma follows. □ 

Realizable. Given a CFG G in Chomsky normal form, we define the realizable CFG G (for clarity we do not define 
it in CNF) that 

• for each production of the form P : A —> a in G have the production P : A —» e, 

• for each production of the form P : A —» BC in G have the production P : A —y a f BCa^ 

and no other productions (the language is then especially over the terminals {a,f :) \ D G {£, R} f\A is a non-terminal}). 
A sequence is realizable if it is a sub-sequence of a word in £(G), i.e., it results from deletion of letters from some 
word of £(G). 

Lemma 17. Let ... a^ k k be a realizable sequence in G. Then for every sequence of words Wi G 

£(G, Ai, D i), ..., Wk G £(G, Ak, Df) there exist words si, ..., Sk, Sfe+i such that siWi ... SkWkSk+i belongs to 

£(G). 

Proof. Let a = aff ... aff k . We consider two cases: Either a G £(G) or not. 

The case where a G £(G). Consider a derivation tree T> for a. We translate it into a derivation tree in G for 
S\wi ... SkWkSk+i, by replacing each production (which are in G) of the nodes in T> with (generalized) productions 
in G. 

Each leaf node v corresponds to a production P : A —> e. By definition there exists a production P : A —» a in G 
and we then simply replace P in G with P in G. 

Each non-leaf node v, with children b and c respectively, corresponds to the use of a production P : A —> 
a^BCa^, where the a^ is the z-th letter and Ur the j-th of w for some i,j. By definition of £(G, A, D) we have 
that there is a production P' : A —>* WiW^BCwjw[ in G for some w[, vjl. In this case we replace P in G with P' 
in G. (The word Sj + i are concatenation of words w'j and letters derived by productions P : A —> a corresponding to 
P : A —>• e. ) 

The case where a f £(G). Find a word a' G £(G) such that a is a sub-sequence of a' (letting p be the sequence 
of positions defining a from a') and do as above with a' and the sequence of words s' which is an extension of the 
sequence s of length \a'\ by inserting e at the remaining positions (i.e., the extension is such that s is the sub-sequence 
of s' defined by p ). □ 

Compact G-decomposition. Given a CFG G with a set of non-terminals of size T and a word w G £(G), we define 
a compact G-decomposition of w as w = s\Ui... Sk'UkSk+i such that 

1. for each Ui, there is an associated terminal aff, such that the sequence ... aff k is realizable and ut G 
£(G, Ai, Di). 

2 . for all (gN, the word we := s\(u\) l S 2 ■ ■ ■ Sk{uk) e Sk+i is in £(G). 
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3. I Wo I = J2i =1 I I < 2 T and k < 2 T+1 - 2. 

Lemma 18. For every CFG G in CNF, every word w £ C(G) admits a compact G-decomposition. 

Intuition. The proof follows by repeated applications of the principle behind the pumping lemma, until the part which 
is not pumped is small. 

Proof. Fix some I and consider some word w in C(G) and some derivation tree V for w. We will greedily construct a 
compact G-representation, using that we do not give bounds on |it,|. 

Greedy traversal and the first two properties. The idea is to consider nodes of 2? in a depth first pre-order traversal 
(ensuring that when we consider some node we have already considered its ancestors). When we consider some node 
v, we continue with the traversal, unless there exists a descendant u of v, such that D[i>] = T>[u\. If there exists such 
a descendant, let v! be the bottom-most descendant (pick an arbitrary one if there are more than one such bottom¬ 
most descendants) such that A := D[r;] = V[u']. We say that ( v,u') forms a pump pair of w. Consider subword 
a v , a u > of w derived by subtrees of V with roots at v and u' respectively. We can then write a v as sa u ' s' (and hence 
A -g-q sAs'), for some s and s' in the obvious way and s and s' will correspond to Ui and u :l respectively for some 
i < j (i and j are defined by the traversal that we have already assigned i — 1 us then we first visit v and then assign 
s as the Ui and then we return to the parent of v, we have assigned j — 1 u’s and assign s' to be Uj). 

Furthermore, m is associated with a^ and Uj is associated with o-r- Observe that A —>* UiAuj implies that 
Ui £ £(G, A, L) and Uj £ £(G, A, R) and we therefore have ensured the first property of compact G-representation. 
This also shows that we can replace Ui with (uf) 1 and Uj with ( Uj) ( (because, clearly A —>•* (ui) e A(uj) e ) and the new 
word is in £(G). Hence, w? is in £(G), showing the second property of compact G-representation. This furthermore 
defines a derivation tree Do for wq (which has 0 occurrences of words «i, «2>•.), which is the same as D, except that 
for each pump pair (v, u'), the node v is replaced with the sub-tree of D with root u' . So as to not split u, or Uj up, 
we continue the traversal on u', which, when it is finished, continues the traversal in the parent of v, having finished 
with v. Notice that this ensures that each node is in at most one pump pair. 

The third property. Consider the word w o which has 0 occurrences of words u±, U 2 , ■ ■ ■■ Observe that in derivation 
tree Do for Wq, there is at most one occurrence of each non-terminal in each path to the root, since we visited all nodes 
of Do in our defining traversal and were greedy. Hence, the height is at most T and thus, since the tree is binary, it has 
at most 2 t_1 many leaves, which is then a bound on |u>o| = X^=i l s t|- Notice that each node of Do, being a subset 
of D, is in at most 1 pump pair of w. On the other hand for each pump pair (v, u') of w, we have that u' is a node of 
Do by construction. Hence, w has at most 2 r — 1 many pump pairs. Since each pump pair gives rise to at most 2 word 
Ui , Ui>, we have k < 2 T+1 — 2. 

□ 

Sets closed under reachability. Fix an NFA. We say that a set Q' of states of the NFA is closed under reachability if 
for all q £ Q' and a £ £ we have S(q, a) C Q'. Clearly, the set of all states is closed under reachability. 

Reachability sets. Fix an NFA. Given a state q in the NFA and a word w, let Q'f be the set of states reachable upon 
reading w, starting in q. The set of states R(tn, q) is then the set of states reachable from Q'f upon reading any word. 
For a set Q' and word w, the set R(w, Q') is U t/ gQ' R(tG q)- 

Note the following: For all Q' and w the set R(u;, (}') is closed under reachability. If a set Q' is closed under 
reachability then R(u>, Q') C Q' for all w. 

We have the following property of reachability sets: Fix a word u, a number I, an NFA and a set of states Q' 
of the NFA, where Q' is closed under reachability. Let u' be a word with I occurrences of u (e.g. ?//). Consider any 
word w with edit distance strictly less than t from u!. Any run on w, starting in some state of Q', reaches a state of 
R(u, Q'). This is because u must be a sub-word of w. 

Lemma 19. Let G be a CFG in CNF with a set of non-terminals of size T and let An be a safety NFA with a set of 
states Q. The following conditions are equivalent: 

(i) the edit distance ed(C(G), C(An)) A infinite, 

(ii) the edit distance ed{C(G), C{An)) exceeds B := (2 T+1 — 2) • n + 2 T , and 
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(iii) there exists a word w £ £(G), with compact G-decomposition w = {siUi)^ =1 Sk+i, such that 

R(uk , R(uk- 1 , R(uk- 2 , ■ ■ ■ R(ui,Q )...))) = 0. 

(iv) there exist words u \,..., Uk such that R(uk , R(uk— i, R{v,k— 2 , • ■ ■ R(u\, Q) ...))) = 0 and for every £ > 0 there 
exist words S\, . .. , Sk +1 such that the word wp = (siU ( i )^ =1 Sk +1 belongs to C{G). 

We use condition (iv) from Lemma [T9] later in Section l4~2l Before we proceed with we argue by example that the 
nested applications of the R function in LemmalT9lis necessary. 

The necessity of the recursive applications of the R operator. Consider for instance the alternate requirement that at 
least one of R(iq, Q) is empty, for some i. This alternate requirement would not capture that the pushdown language 
{a n #b n | n £ N} has infinite edit distance to the regular language a* + b* — for any word in the pushdown language 
w = a n #b n , for some fixed n, a compact G-representation of w is iti = a n , S 2 = # and U 2 = b n (and the remaining 
words are empty). But clearly R(ttj, Q) and R(u 2 , Q) are not empty since both strings are in the regular language. On 
the other hand R(rt 2 , R(iti, Q)) is empty. 

Proof. The implication (i) =>■ (ii) is trivial. 

We show the implication (ii) =>■ (iii) as follows: Consider a word w £ C{G) with ed(w, jC(An)) > 
B and its compact G representation w = (siMj)i=i s fe+i (which exists due to Lemma 1 1 8[ i. We claim 
that R(wfc, R(ufc_i, R(uk- 2 , ■ ■ ■ R(wi, Q) ■ ■ ■))) = 0- The argument is by contradiction. Assume that 
R(ufc, R(ufc_i, R(wfc_ 2 , ■ ■ • R(rti, Q )...))) ^ 0 and we will construct a run of spelling a word w' in C(An), 
which has edit distance at most B to w. The description of the run is iteratively in i; we start with 1 = 0. First, spell 
out a word s', so that An reaches some state g,; such that there exists a run on Uj. The length of s' is at most n. 
Afterwards follow the run on u* and go to the next iteration. This run spells the word w' := (s'uJJLj. All the choices 
of qf s can be made since R(«fe, R(ufe_i, R(ttfc_ 2 , • ■. R(rti, Q )...))) 0. Also, since An is a safety automata, this 

run is accepting. To edit w' into w change each s' into s, and insert s^+i at the end. In the worst case, each s* is 
empty except for i = k + 1 and in that case it requires k -n + |wo| < B edits for deleting each s' and inserting s;t+i at 
the end (in any other case, we would be able to substitute some letters when we change some s' into s t which would 
make the edit distance smaller). This is a contradiction. 

The implication (iii) => (iv) is trivial. 

For the implication (iv) => (i) we will argue that for all £, the word tup- £ C(G) requires at least 
l edits. Consider wp = (sjuf)|L 1 Sfc +1 for some £. Any run on siu[ (a prefix of wp) has entered 

R(w i, Q) or made at least £ edits by the property of reachability sets. Similarly, for any j, any run on 

( s i u i)i =1 has either entered R(m :) -, R(uj_i, R(rtj_ 2 , ■ ■ ■ R(tti, Q) ■ ■.))) or there has been at least £ edits. Since 

R (uk. R(iifc_i, R(u fe _ 2 , ■ ■ • R(ui, Q )...))) = 0, no run can enter that set and thus there has been at least £ edits 

on wp. The implication and thus the lemma follows. □ 

As a direct consequence of Lemmafl9lwe have the following. 

Theorem 20. (1) For a PDA Ap and an NFA An we have ed(C(Ap), C(A\r)) is either exponentially bounded in 
\Ap\ or it is infinite. (2) ForC\ £ {DPDA, PDA} we have FED(Ci, NFA) is in ExpTime 

Proof. (1) The equivalence of (i) and (ii) gives a bound on the maximum finite edit distance. 

(2) The argument follows from Lemma[6]and (1), i.e., we can check with Lemma[6]TED for k exceeding the bound 
from(l). □ 

4.2 Upper bound for DFA 

We show that for Ci £ (DPDA, PDA} the problem FED(Ci, DFA) is CoNP-complete. 

CONP-hardness and attempting to apply known techniques for the upper bound. The lower bound follows 
directly from the fact that FED(DFA, DFA) is CoNP-hard (.3). We thus focus on the upper bound. Note that the upper 
bound was simple for FED(DFA, DFA), since the edit distance for such is either polynomial or infinite and there is 
a polynomial length witness in case it is infinite. Hence, one just guess the polynomial sized witness w and runs 
a polynomial time algorithm for ed(w, DFA) and the result follows. Doing the similar thing for FED(PDA, DFA) 
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would give a NExpTime upper-bound, since the word we need to guess might be of exponential length (thus the 
above ExpTime upper bound for FED(PDA, NFA) is better). To give our algorithm, we will first define extended 
reachability sets and give a key proposition. 

Closed under concatenation and extended reachability sets. A language L is said to be closed under concatenation 
if for all wi,W 2 £ L we have w x w -2 £ L. Note that L(G, A , D), for any non-terminal A and direction D, is always 
closed under concatenation. 

We extend reachability sets as follows: Let L be a context-free language closed under concatenation, let An be a 
DFA and let Q’ be a subset of the states of An. We define R(L. Q r ) as the intersection f] weL R(uj, Q'). Observe that 
for every L there exists a finite subset W C L such that C\ wGW R(w, Q') = R(L, Q'). 

Remark 21. If Q' is closed under reachability, then for any set of words W = {wi, W 2 , ■ ■ ■, w k } C L such 
that R( w j Q') = Q'), we have that w' = W\W 2 ■ ■ ■ w k £ L and R(w ', Q') = R(L, Q'). The latter comes 

from the fact that for any word w" and set Q" closed under reachability, we have that R(s\w" S 2 , Q") C R(w ", Q") 
for all Si and S 2 . 

Also, observe that we have the following facts about R, from the definition of R: 

1. For any Q" C Q' and word w we have that R(w , Q") C R(w, Q'). 

2. For any language L, any Q' and word w £ L, we have that R{L , Q') C R(w , Q'). 

The following proposition is a key to our CoNP-algorithm. 

Proposition 22. For any k, any sequence of languages L \,..., Lfc and any word Wi £ Li for each i, we have 

R{L k , ■ • ■, R(Li,Q) ...))<£ R(w k ,..., R(wi,Q )...))) . 

Also, if each Li is closed under concatenation, then there exist words w[ £ Li for each i, such that 

R{L k ,..., R(Li,Q) ...)) = R{w ' k ,..., R(w[,Q )...))) 

Proof. The proposition follows from Remark[2T| and simple induction. □ 

CONP-upper bound algorithm. Our CoNP-algorithm InfEdsSeq deciding whether the edit distance is finite works 
as follows: 

1. Guess a sequence s = aff ... aff , for some k. 

2. return “no” if s is such that (1) s is realizable; and (2) 

R(£(G, A k ,D k ), R(C(G, Ak-i,D k _i),... R (C(G, A 1 ,D 1 ),Q).. .))) = 0 . 

3. otherwise return yes. 

Requirements for InfEdsSeq to be in CONP. For InfEdsSeq to be in CONP, we need to give the following: 

1. A polynomial bound on k (so that s is a polynomial sized witness). The bound will be given in Lemmal24l 

2. A polynomial time algorithm to decide whether a sequence s — a ^ a^f 2 ... dp k k is realizable. The algorithm 
will be given in Lemmal25l 

3. A polynomial time algorithm for computing R(£(G, A, !)), Q') for any CFG G, any non-terminal A, any di¬ 
rection D and any set Q' closed under reachability. This will allow us to decide, given a realizable sequence 
s = affi ... o^*, whether 

R(G(G, A k ,D k ), R (C(G, A k _ x , D k _ x ),... R(£(G, A X ,D X ), Q).. .))) = 0 , 

by evaluating the expression on the left-hand side inside-out. The algorithm for computing R(£(G, A, D ), Q') 
will be given in Corollary [27] 
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We will first argue that the algorithm is correct. 

Lemma 23. The algorithm InfEdsSeq is correct. 

Proof. To argue that the algorithm is correct, we just need to argue that a sequence with properties (1) and (2) exists if 
and only if the edit distance is infinite. 

Such a sequence implies infinite edit distance. According to Proposition[22] such a sequence indicates that there are 
words Wi £ C(G , Ai, Df) for each i, such that 

R {wk, • • ■, R(wi, Q) ■ ■ ■))) = 0 • 

For all i, since C(G, Ai, Df) is closed under concatenation, we also have that w\ £ £(G, Ai,Di) for all f > 0. Thus, 
by Lemma IT71 there exist words si,..., s k , Sfc+i such that Siuu{ ... s k w t k Sk +i belongs to C(G). Hence, item (iv) of 
Lemma[l9lis satisfied and we get that the edit distance is infinite. 

Infinite edit distance implies the existence of such a sequence. When the edit distance is infinite, then, according 
to item (iii) of Lemma [l9l there exists a word w £ £(G), with compact G-decomposition w = (siUi)i_ 1 s k + 1 , such 
that R (iik, R(u fe _i, R(u fc _ 2 , • • ■ R(ui, Q )...))) = 0. By definition of compact G-decomposition, every Ui from the 
decomposition is associated with a terminal afj ., such that the sequence aff ... a is realizable (satisfying property 
(1)) and Ui £ C(G, Ai,Di) for each i. By Pronositionl22l we then have that 

R(G(G, A k ,D k ), R(£(G, A^D^),... R(£(G, A 1 ,D 1 ), Q )...))) = 0 , 

(satisfying property (1)). Thus such a sequence always exists and the lemma follows. □ 

Next, we will give the bounds and algorithms to show that InfEdsSeq is in CONP. First the bound on k. 

Lemma 24. Let G be a CFG and let Ad be a safety DFA with a set of states Q. The following conditions are 
equivalent: 

(i) the edit distance ed(C(G), C(Ad)) is infinite. 

(ii) there exists a realizable sequence with m < \Q\ such that 

R{C(G, A m , D m ), R{C(G, A m _i, D m - 1),..., R(C(G, A\, D\), Q) ...)) = 0 . 

Proof, (i) implies (ii). Assume that ed(C(G), C{Ad)) is infinite. By LemmaIT9l there exists a word w £ C{G), 
with compact G-decomposition w = (sitt,)f =1 Sfe+i, such that R (u k , R(ufe-i, ■ • •. R(tti, Q )...)) = 0. Observe that 
k can be exponential. We claim that we can pick from m ,..., u k a sub-sequence of polynomial length in |Q| for 
which the reachable set of states is empty as well. Indeed, the sequence s = R(ui, Q), R(it 2 , R(ui, Q )),... is weakly 
decreasing with respect to the set inclusion (i.e. if a state is not in s[i], then, it cannot be in s[j] for j > i, because R 
is closed under reachability). We select from 1,..., k indices i on which the sequence R(iti, Q), R(tt 2 , R(iti, Q)), ■ ■ ■ 
strictly decreases and denote the resulting sub-sequence by a. Then, 

R(tta(m) i 1) ) * * « > R(tt CK (l) ,Q)...)))=0 . 

There are at most \Q\ such indices, therefore \a\ = m < \Q\. Using Pronositionl22l since u a ^ £ C(G, A a ^, D a ^) 
by compact G-decomposition, we get that: 

R(£(G, A a ( m ),H a ( m )),..., R(£(G, A a ^), D a ^), Q )...)) C R(u a (m)> • • • > RK(i), Q) ■ ■ ■))) = 0- 

(ii) implies (i). Assume that condition (ii) holds. Then, the algorithm InfEdsSeq returns YES, and its correctness 
(Lemma[23li implies (i). □ 

Next we will describe the algorithm deciding whether a sequence is realizable. 
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Lemma 25. Let G be a CFG. We can decide in polynomial time whether a given sequence s = off ... is 
realizable. 

Proof. Consider grammar G associated with G. We convert G to CNF and add productions A —> e for every non¬ 
terminal. Let the resulting CFG be G'. Observe that G' derives a word of terminals and non-terminals u if and only 
if G derives a word it' such that it is a subsequence of it'. Thus, ( A \. I) \),..., ( A /,,, Dp.) is realizable if and only if 
A [ >l ... A® k is derivable by G'. Since G' has polynomial size in G, we can check whether a word is derivable in G' 
in polynomial time. □ 

Finally, we present the algorithm that computes R(£(G, A, If), Q'). The result will follow as a corollary of the 
following lemma. 

Lemma 26. Given a CFG G, such that C(G) is closed under concatenation, a DFA Ad with a set of states ! and a 
set of states Q’ C Q closed under reachability, the set R(C(G), Q') is computable in polynomial time. 

Proof. Given a set of states S C Q, we define Reach (S') as the set of states reachable from Q’ in Ad- We can divide 
Q' into strongly connected components (SCCs). We say that an SCC G is recurrent if Ad can stay in G upon reading 
any word from C{G). 

We claim that R(£(G), Q') is the set of states Q* reachable from all recurrent SCCs in Q’. Clearly, Q* is closed 
under reachability. 

• We will first argue that Q* C R(£(G), (f). First, for every recurrent SCC G and every word w £ £(G), 
there is a state s' £ C such that R(ic, s') £ C. Therefore, G C R(in, s'). By Remark[2U it follows that 
G C R(£(G), Q') and Q* C R(£(G), Q'). 

• We will next argue that R(£(G), Of ) C Q* Observe that for every state s in a non-recurrent SCC G there exists 
a word w s £ C(G) that forces Ad to leave G, i.e., R(ic s , s) n G = 0. Thus, |R(ic s , G) fl G| < |G|. It follows 
that we can remove states from R(w s , G) fl G one by one by concatenating words w s to obtain a word u>c such 
that R(icc, G) fl G = 0. Since C(G) is closed under concatenation, the word wc belongs to C(G). 

Let C\,0-2, ■■■ ,C(, be the SCCs in (f not in Of (and thus non-recurrent) ordered topologically. Let the word 
wt be the word wt = wc 1 wc 2 ■ ■ ■ w c l - Observe that wt £ £(G). We have R(£(G),<5') C R(wt,Q') by 
Remark [211 and we argue that R (wt,Q') f Q*■ 

Any run starting in Q* will end in Q*, since Of is closed under reachability. Observe that R (u’c, • G-| U 
... U Cf) does not contain Gi as R (w-'c,. G-|) fl Gi = 0 and due to topological order Gi is not reachable 
from C 2 ,..., Ci. Thus, by induction reasoning we have R(ii'c’i, R(wc 2 , ■ • •, R (wc e , Cf )...) does not contain 
Ci,..., Ci. Observe that R(ict, Gi U... UCf C R(iuci, R(iuc 2 j • • •, R(ii’c<, Cf ...), and hence R(ict, Gi U 
... U Cf) C Q*. 

Given a SCC C and a state s £ G, let the automaton A ( )[f be Ad restricted to G and with start state s. Observe 
that a SCC G is recurrent if and only if there is a state s £ C such that C(G) C C{A ( r f). We can then easily test if a 
SCC is recurrent by trying each possibility for s £ C and testing if C[G) C C(A ( )f ). This can be done in polynomial 
time since language inclusion of a CFG in a DFA can be tested in polynomial time. 

Thus our algorithm is as follows: Compute the set {Ci,..., Cf\ of SCCs in (f. For each i test if C, is recurrent 
and let [C \,..., Cf} be the recurrent SCCs in Of. Return (J f =1 Reach(C'). □ 

We next get the wanted corollary. 

Corollary 27. Given a CFG G, a non-terminal A, a direction D, a DFA Ad and a set of states Cf of Ad closed 
under reachability, the set R)C(G, A, D), Q') is computable in polynomial time. 

Proof. The proof follows from LemmalT6land Lemmal26l using that C(G. A, D) is closed under concatenation. □ 

Lemma 28. Given a context-free grammar G and a safety DFA Ad the algorithm InfEdsSeq can be implemented in 
CONP and correctly decides whether ed{C(Ap ), C{Ad)) is finite. Moreover, if Ad is of constant size then InfEdsSeq 
does not need non-determinism (and thus uses polynomial time only). 
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Proof. The correctness comes from Lemma[23l The complexity comes from Lemma[24l Lemmal25land Corollary 1271 
Note that, in case the DFA is of constant size, then k is bounded by a constant, according to Lemma[24l and thus there 
are only a polynomial number of candidates for s and hence all can be checked using polynomial time in total. □ 

Theorem 29. For C\ £ {DPDA, PDA} we have FED(Ci, DFA) is CoNP-complete. 

Proof. First, we discuss containment of FED(PDA, DFA) in CONP. Consider a PDA Ap and a DFA Ad- We can 
transform Ap to a context-free grammar G with C{Ap) = £(G) in polynomial time. Also, we can transform Ad 
to a safety DFA A' D recognizing the language prefix(£(A/j)). Due to Lemma ITdl we have ediCiG). Ad) is finite if 
and only if ed(£(G), A! D ) is finite. By Lemma 1281 we can decide whether ed(C(G),A' D ) is finite in CONP. Hence, 
FED(PDA, DFA) and FED(DPDA, DFA) are in coNP. 

Is has been shown in (3J that FED(DFA, DFA) is CoNP-hard, therefore FED(PDA, DFA) and FEDfDPDA, DFA) 
areCoNP-hard □ 

4.3 Lower bound 

We have shown the exponential upper bound on the edit distance if it is finite. As mentioned in the introduction, it is 
easy to define a family of context free grammars only accepting an exponential length word, using repeated doubling 
and thus the edit distance can be exponential between DPDAs and DFAs. We can also show that the inclusion problem 
reduces to the finite edit distance problem FED (DPDA, NFA) and get the following lemma. 

Lemma 30. FED(DPDA, NFA) is ExpTime-liard. 

Proof. We show that the inclusion problem of DPDA in NFA, which is ExpTime-hard by Lemma [8] reduces 
to FEDfDPDA, NFA). Consider a DPDA Ap and an NFA An- We define C = ... ffwkff : k £ 

N,wi,... ,Wk £ £}. Observe that either £i C £ 2 or ed(C\, C 2 ) = 00 . Therefore, ed(C\,Cf) < 00 if and only if 
£1 C £ 2 . In particular, C{Ap) C C{An) if and only if ed(C(Ap), C(An)) < 00 . Observe that in polynomial time 
we can transform Ap (resp., _4 y) to a DPDA Ap (resp., an NFA An) recognizing C(Ap) (resp., C(Ap)). It suffices 
to add transitions from all final states to all initial states with the letter ff, i.e., {(g, #, s) : q £ F,s £ S} for NFA 
(resp., {(g, ff, _L, s) : q £ F,s £ S} for DPDA). For DPDA the additional transitions are possible only with empty 
stack. □ 


5 Edit distance to PDA 

Observe that the threshold distance problem from DFA to PDA with the fixed threshold 0 and a fixed DFA recognizing 
E* coincides with the universality problem for PDA. Hence, the universality problem for PDA, which is undecidable, 
reduces to TED(DFA, PDA). The universality problem for PDA reduces to FED(DFA, PDA) as well by the same 
argument as in Lemma[30] Finally, we can reduce the inclusion problem from DPDA in DPDA, which is undecidable, 
to TED(DPDA, DPDA) (resp., FEDfDPDA, DPDA)). Again, we can use the same construction as in Lemma [30l In 
conclusion, we have the following proposition. 

Proposition 31. (1) For every class C £ {DFA, NFA, DPDA, PDA}, the problems TED(C, PDA) and FED(C, PDA) 
are undecidable. (2) For every class C £ {DPDA, PDA}, the problem FED(C, DPDA) is undecidable. 

The results in (1) of Proposition [311 are obtained by reduction from the universality problem for PDA. However, 
the universality problem for DPDA is decidable. Still we show that TED(DFA, DPDA) is undecidable. The overall 
argument is similar to the one in Section 13.21 First, we define nearly-deterministic PDA, a pushdown counterpart of 
nearly-deterministic NFA. 

Definition 32. A PDA A = (E, £, Q 1 S, 5, F) is nearly-deterministic {/" | S' | = 1 and 6 = (Si U<5 2 , where is a function 
and for every accepting run, the automaton takes a transition from 82 exactly once. 
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By carefully reviewing the standard reduction of the halting problem for Turing machines to the universality 
problem for pushdown automata fT4l . we observe that the PDA that appear as the product of the reduction are nearly- 
deterministic. 

Lemma 33. The problem, given a nearly-deterministic PDA Ap, decide whether C(Ai>) = £*, is undecidable. 

Using the same construction as in Lemma [TO] we show a reduction of the universality problem for nearly- 
deterministic PDA to TED(DFA, DPDA). 

Proposition 34. For every class C € {DFA, NFA, DPDA, PDA}, the problem TED(C, DPDA) is undecidable. 

Proof. We show that TED(DFA, DPDA) (resp., FED(DFA, PDA)) is undecidable as it implies undecidability of the 
rest of the problems. The same construction as in the proof of LemmalTOlshows a reduction of the universality problem 
for nearly-deterministic PDA, which is undecidable by Lemmal33l to TED(DFA, DPDA). □ 

We presented the complete decidability picture for the problems TED(Ci,C 2 ), forCi € {DFA, NFA, DPDA, PDA} 
and C 2 € {DPDA, PDA}. To complete the characterization of the problems FED(Ci, C 2 ), with respect to their decid¬ 
ability, we still need to settle the decidability (and complexity) status of FED(DFA, DPDA). We leave it as an open 
problem, but conjecture that it is CoNP-complete. 

Conjecture 35. FED(DFA, DPDA) is coNP-complete. 

6 Conclusions 

In this work we consider the edit distance problem for PDA and its subclasses and present a complete decidability 
and complexity picture for the TED problem. We leave some open conjectures about the parametrized complexity 
of the TED problem, and the complexity of FED problem when the target is a DPDA. Moreover, one can study the 
edit distance for other classes of languages between regular languages and context-free languages such as visibly 
pushdown automata. 

While in this work we count the number of edit operations, a different notion is to measure the average number 
of edit operations. The average-based measure is undecidable in many cases even for finite automata, and in cases 
when it is decidable reduces to mean-payoff games on graphs ffl. Since mean-payoff games on pushdown graphs are 
undecidable flOl , most of the problems related to the edit distance question for average measure for DPDA and PDA 
are likely to be undecidable. 
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